TOPOLOGICAL OPTIMIZATION OF THE EVALUATION OF 
FINITE ELEMENT MATRICES 



ROBERT C. KIRBY *, ANDERS LOGG t, L. RIDGWAY SCOTT t, AND 
ANDY R. TERREL § 

Abstract. We present a topological framework for finding low-flop algorithms for evaluating 
element stiffness matrices associated with multilinear forms for finite element methods posed over 
straight-sided affine domains. This framework relies on phrasing the computation on each element 
as the contraction of each collection of reference element tensors with an element-specific geometric 
tensor. We then present a new concept of complexity-reducing relations that serve as distance 
relations between these reference element tensors. This notion sets up a graph-theoretic context in 
which we may find an optimized algorithm by computing a minimum spanning tree. We present 
experimental results for some common multilinear forms showing significant reductions in operation 
count and also discuss some efficient algorithms for building the graph we use for the optimization. 
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1. Introduction. Several ongoing projects have led to the development of tools 
for automating important aspects of the finite element method, with the potential for 
increasing code reliability and decreasing development time. By developing libraries 
for existing languages or new domain-specific languages, these software tools allow 
programmers to define variational forms and other parts of a finite element method 
with succinct, mathematical syntax. Existing CH — h libraries for finite elements include 
DOLFIN [3 [7], Sundance [Tni E] and deal. II [2J. These rely on some combination 
of operator overloading and object orientation to present a high-level syntax to the 
user. Other projects have defined new languages with their own grammar and syntax, 
such as FreeFEM [5U], GetDP [S], and Analysa [1 . Somewhere between these two 
approaches is the FEniCS Form Compiler, FFC (T21 [TS] developed primarily by the 
second author. This Python library relies on operator overloading to define variational 
forms, but rather than using on the Python interpreter to evaluate the forms, FFC 
generates low-level code that can be compiled into other platforms such as DOLFIN. 

While these tools are effective at exploiting modern software engineering to pro- 
duce workable systems, we believe that additional mathematical insight will lead to 
even more powerful codes with more general approximating spaces and more power- 
ful algorithms. To give one example, work by the first author [9J [TU] shows how the 
Ciarlet definition of the finite element leads to a code for arbitrary order elements 
of general type. This code, FIAT, is used by FFC and is currently being integrated 
with Sundance. For further examples, we refer to work by the first three authors and 
Knepley [TTJ [12], which studies how to efficiently (in the sense of operation count) 
evaluate local stiffness matrices for finite element methods. For a given variational 
form, for each element e, there exists a "geometric tensor" G e . Each entry of the 
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n x n element stiffness matrix for e may be written as the contraction of a specific 
reference element tensor (depending only on the form, the polynomial basis and the 
reference element) with G e . We saw that there are relations between many of the 
reference element tensors (equality, colinearity, small Hamming distance) that may 
be exploited to find an algorithm with significantly fewer floating point operations 
than standard techniques. 

In [11], we devised a crude algorithm that searched for and exploited these 
dependencies, generating simple Python code for evaluating the local stiffness matrix 
for the Laplacian on triangles given the geometric tensor. In this paper, we set the 
optimization process in a more abstract graph/topological context. In Section [21 
we express the calculation for some finite element operators as tensor contractions. 
These ideas are more fully developed in |13j . where the first two authors show how 
this tensor structure may be used to implement a compiler capable of generating 
the reference tensors and hence code for building the stiffness matrices for general 
multilinear forms on affine elements. Further, a generalization to non-afnne mappings, 
including curvilinear elements, is suggested. In Section [3j we introduce the idea of 
complexity-reducing relations, which are a kind of distance relations among the tensors 
that serve to model the cost of complexity in computing the contractions. Using 
these ideas, we show how to derive an algorithm for performing the computation 
that is optimal in a certain sense. We demonstrate the reduction in operation count 
for several finite element operations in Section 01 After this, we show how these 
efficient algorithms may be derived more efficiently through the use of certain search 
procedures in Section[5] Finally, we state some conclusions and remarks about ongoing 
work in Section [6[ 

2. Finite element formulation. The finite element method is a general method- 
ology for the discretization of differential equations. A linear (or linearized) differential 
equation for the unknown function u is expressed in the form of a canonical variational 
problem and the discrete approximation U of u is sought as the solution of a discrete 
version of the variational problem [U [3] : Find U £ V such that 

a(v,U) = L(v) VveV, (2.1) 

where (V, V) is a pair of suitable discrete (typically piecewise polynomial) function 
spaces, a : V x V — > R a bilinear form and L:F^la linear form. 

The variational problem (|2.ip corresponds to a linear system At; = b for the 
expansion coefficients £ £ R M of the discrete function U in a basis {<£i}fL\ for V. 
If {<f i}f=i is the corresponding basis for V, the entries of A and b are given by 
Aij — a((f>i,(pj) and bi = L(ipi) respectively. When we consider general multilinear 
forms below, the multilinear form a is represented by a tensor A. 

2.1. Multilinear forms. Let now {Ui}[ =1 be a given set of discrete function 
spaces defined on a triangulation T = {e} of 12 C R d . We consider a general multi- 
linear form a defined on the product space V\ X V2 X • • • X V r : 

a : Vi x V 2 x ■ • • x V r -t R. (2.2) 

Typically, r = 1 (linear form) or r = 2 (bilinear form) , but forms of higher arity appear 
frequently in applications and include variable coefficient diffusion and advection of 
momentum in the incompressible Navier-Stokes equations. 
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Let {¥>?}<=!, • • • . Wl)T=i be bases of V x , V 2 , . . . , V r and let i = (n,t a , 

be a multiindex. The multilinear form a then defines a rank r tensor given by 

Ai = <!(<&,<&,..., (2.3) 

In the case of a bilinear form, the tensor A is a matrix (the stiffness matrix), and in 
the case of a linear form, the tensor A is a vector (the load vector) . 

To compute the tensor A by assembly, we need to compute the element tensor A e 
on each element e of the triangulation T of Q [T3]. Let {^f' 1 }^! be the restriction to 
e of the subset of {tpj } i= j supported on e and define the local bases on e for V2, • • • , V r 
similarly. The rank r element tensor A e is then given by 

A? =a e (<p% L ,<pg,.. (2-4) 

2.2. Evaluation by tensor representation. The element tensor A e can be 
efficiently computed by representing A e as a special tensor product. If the multilinear 
form a is given by an integral over the domain f2, then each entry A\ of A e is given 
by an integral over the element e. By a change of variables and a series of linear 
transformations, we may rewrite this integral as an integral over a reference element 
E. In particular, when the map from the reference element E is affine, the linear 
transformations of derivatives can be moved outside of the integral to obtain a repre- 
sentation of the element tensor A e as a tensor product of a constant tensor A and a 
tensor G e that varies over the set of elements, 

A\ = A\ a G1, (2.5) 

or more generally a sum A\ — A^G° k of such tensor products, where i and a are 
multiindices and we use the convention that repetition of an index means summation 
over that index. We refer to A as the reference tensor and to G e as the geometric 
tensor. The rank of the reference tensor is the sum of the rank r — \i\ of the element 
tensor and the rank |a| of the geometric tensor G e . As we shall see, the rank of the 
geometric tensor depends on the specific form. 

In [13) , we present an algorithm that computes the tensor representation (|2.5[) for 
fairly general multilinear forms. This algorithm forms the foundation for the FEniCS 
Form Compiler, FFC [T3] . 

As an example, we consider here the tensor representation of the element tensor 
A e for Poisson's equation —Au(x) — f(x) with homogeneous Dirichlet boundary 
conditions on a domain f2. The bilinear form a is here given by a{v,u) — J n 'SZv(x) ■ 
V'u(ar) dx and the linear form L is given by L(v) — Jn v{x)f{x) dx. By a change of 
variables using an affine map F e : E — > e, we obtain the following representation of 
the element tensor A e : 

dxp dxp J E dX ai dX a . 2 



(2.6) 



where A a = f 9 ^ {X) a ^ x) dx q* = detF /^£-L^2. and $J = p 

for j ~ 1,2. We see that the reference tensor A is here a rank four tensor and the 
geometric tensor G e is a rank two tensor. In [llj . we saw that dependencies frequently 
occur between A? and A® for many multiindices i,j that can reduce the overall cost 
of computation. 
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3. Optimizing stiffness matrix evaluation. 

3.1. An abstract optimization problem. Tensors of arbitrary rank may be 
rewritten as vectors, with contraction becoming the Euclidean inner product. So, we 
will just deal with vectors and inner products to formulate our abstract optimization 
problem. We let Y = {y l }2=i be a collection of n vectors in E m . In the most general 
case, this is a collection rather than a set, as some of the items in Y may be identical. 
Corresponding to Y, we must find a process for computing for arbitrary g E K m the 
collection of items {(y l ) <?}™=i- Throughout this paper, we will measure the cost of 
this as the total number of multiply-add pairs (MAPs) required to complete all the 
dot products. This cost is always bounded by nm, but we hope to improve on that. 
This could be alternatively formalized as building an abstract control-flow graph for 
performing the dot products that is equivalent to the naive process but contains a 
minimal number of nodes. Our techniques, however, rely on structure that is not 
apparent to traditional optimizing compilers, so we prefer the present formulation. 

We seek out ways of optimizing the local matrix evaluation that rely on notions 
of distance between a few of the underlying vectors. The Euclidean metric is not 
helpful here; we focus on other, discrete measures of distance such that if y and z are 
close together, then y l g is easy to compute once z t g is known (and vice versa). Many 
of the dependencies we considered in were between pairs of vectors — equality, 
colinearity, and Hamming distance. Here, we develop a theory for optimizing the 
evaluation of finite clement matrices under binary relations between the members of 
the collection. This starts by introducing some notions of distance on the collection of 
vectors and finds an optimized computation with respect to those notions by means 
of a minimum spanning tree. 

3.2. Complexity-reducing relations. Definition 3.1. Let p : Y x Y — >• 

[0, to] be symmetric. We say that p is complexity-reducing if for every y,z G Y 
with p(y, z) < k < m, y l g may be computed using the result z l g in no more than k 
multiply-add pairs. 

Example 1. Let e + {y.z) = d(l — S yz ), where 5 y z is the standard Kronecker 
delta. Then, e + is seen to be complexity-reducing, for if e + (y, z) — 0, then y t g = z l g 
for all g € R m and hence the former requires no arithmetic once the latter is known. 
Similarly, we can let e~(u,v) — e + (u.~ v), for if u = —v, then computing u l g from 
v l g requires only a sign flip and no further floating point operations. 

Example 2. Let 

(0, y = z 

c(y,z)=< 1, y = az for some a € K, a =^ 0, 1 (3-1) 
I to, otherwise 

Then c is complexity-reducing, for y l g — {az) l g — a(z g), so y l g may be computed 
with one additional floating point operation once z l g is known. 

Example 3. Let H + (y, z) be the Hamming distance, the number entries in which 
y and z differ. Then H + is complexity-reducing. If H + (y, z) — k, then y and z differ 
in k entries, so the difference y — z has only k nonzero entries. Hence, (y — z) g costs 
k multiply-add pairs to compute, and we may write y l g = (y — z) l g + z l g. By the 
same argument, we can let H~(y, z) = H + (y, —z). 

Theorem 3.2. Let p\ and pi be complexity-reducing relations. Define 



p(y,z) = mm(p 1 (y,z),p 2 {y,z)). 



(3.2) 
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Then p is a complexity-reducing relation. 

Proof. Pick y, z e Y, let 1 < i < 2 be such that p(y,z) = pi(y,z) and let 
Pi(y,z) = k. But pi is a complexity-reducing relation, so for any g G R m , y'g may 
be computed in no more than k = p{y,z) multiply-add pairs. Hence p is complexity- 
reducing. □ 

This simple result means that we may consider any finite collection of complexity- 
reducing relations (e.g. colinearity together with Hamming distance) as if they were 
a single relation for the purpose of finding an optimized computation in the later 
discussion. 

Definition 3.3. If p is a complexity-reducing relation defined as the minimum 
over a finite set of complexity-reducing relations, we say that it is composite. If it is 
not, we say that it is simple. 

To see that not all complexity-reducing relations are metrics, it is easy to find 
a composite complexity-reducing that violates the triangle inequality. Let p(jj, z) = 
min(iJ + (y, z), c(y, z)). It is not hard to show that H + and c are both metrics. If 
we take y = (1, 2, 2)' and z = (0, 4, 4)*, then p(y, z) = 3 since the vectors are neither 
colincar nor share any common entries. However, if we let x = (0, 2, 2)', then p(y, x) = 
1 since the vectors share two entries and p(x, z) = 1 by colinearity. Hence, p(y, z) > 
p(y, x) + p(x, z), and the triangle inequality fails. 

Remark 1. It may be possible to enrich the collection of vectors with Steiner 
points to reduce the overall cost of the computation. For example, if the collection of 
vectors {(1, 2, 2)*, (0, 4, 4)*} above were enriched with the additional vector (0,2,2)', 
the cost of the dot products with arbitrary g would be reduced from six multiply-add 
pairs (three for each vector) to five (three for the first vector, then one to compute the 
dot products for each of the other two), even though there are more overall vector dot 
products to perform. 

3.3. Finding an optimized computation. Having defined complexity-reducing 
relations and given several examples, we now show how they may be used to deter- 
mine an optimized evaluation of the stiffness matrix. We shall work in the context of 
a single complexity-reducing relation p, which may be composite. 

In order to compute {(y l )'<?}™ = i, we would like to pick some y l e Y and compute 
(y l ) ) g. Then, we want to pick some y 3 that is very close to y l under p and then 
compute (y J Yg. Then, we pick some y k close to either y % or y 3 and compute that dot 
product. So, this process continues until all the dot products have been computed. 
Moreover, since the vectors Y depend only on the variational form and finite element 
space and not the mesh or parameters, it is possible to do this search once offline and 
generate low-level code that will exploit these relations. We first formalize the notion 
of finding the optimal computation and then how the code may be generated. 

We introduce a weighted, undirected graph G = (Y, E) where Y is our collection 
of vectors defined above. Our graph is completely connected; that is, every pair of 
vectors y x ,y 3 are connected by an edge. The weight of this edge is defined to be 
p{y l ,y 3 ). We may think of walking along the edge from y % to y 3 as using the dot 
product {y l ) l g to compute (y 3 Yg. If p is composite, it will be helpful to know later 
which underlying relation gave the minimum value used for p. So, suppose that 
p(y, z) — min{p.i(y, z)}fL 1 . For any fixed y, z, let g(y, z) be a in integer in [1, R] such 
that p e (y !Z )(y, z) = p(y,z). In addition to weights, we thus associate with each edge 
{y\y J } the entity g{y\y 3 ). 

A standard graph-theoretic object called a minimum spanning tree is exactly 
what we need [14j . A spanning tree, which we shall denote (Y, E') is a subgraph that 
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satisfies certain properties. First, it contains all of the n nodes of the original graph. 
Second, (Y, E') is connected. Finally, E' has n — 1 edges, so that there are no cycles 
(thus it is a tree). Now, there are possibly many spanning trees for a given graph. 
Every spanning tree has a weight associated with it that is the sum of the weights 
of its edges. A minimum spanning tree is a spanning tree such that the sum of the 
edge weights is as small as possible. Minimum spanning tree algorithms start with 
a particular node of the graph, called the root. Regardless of which root is chosen, 
minimum spanning tree algorithms will generate trees with exactly the same sum of 
edge weights. 

While technically a minimum spanning tree is undirected, we can think of it as 
being a directed graph with all edges going away from the root. Such a notion tells 
us how to compute all of the dot products with minimal operations with respect to 
p. We start with the root node, which we assume is y , and compute (y )*^. Then, 
for each of the children of y° in the tree, we compute the dot products with g using 
the result of (y°)*g. Then, we use the dot products of the children to compute the 
dot products of each of the children's children, and so on. This is just a breadth-first 
traversal of the tree. A depth-first traversal of the tree would also generate a correct 
algorithm, but it would likely require more motion of the computed results in and out 
of registers at run-time. 

The total number of multiply- add pairs in this process is m for computing (t/ )*<? 
plus the sum of the edge weights of (Y, E'). As the sum of edge weights is as small as 
it can be, we have a minimum-cost algorithm for computing {(y l )*<?}™ =1 with respect 
to p for any g G R m . On the other hand, it is not a true optimal cost as one could 
find a better p or else use relations between more than two vectors (say three coplanar 
vectors). One other variation is in the choice of root vector. If, for example some 
y % has several elements that are zero, then it can be dotted with g with fewer than 
m multiply add pairs. Hence, we pick some y G Y such that the number of nonzero 
entries is minimal to be the root. We summarize these results in a theorem: 

Theorem 3.4. Let G — (Y,E) be defined as above and let g G W n be arbitrary. 
The total number of multiply-add pairs needed to compute {(y l )'<?}™ = i is no greater 
than m! + w, where m! is the minimum number of nonzero entries of a member of Y 
and w is the weight of a minimum spanning tree of G. 

The overhead of walking through the tree at runtime would likely outweigh the 
benefits of reducing the floating point cost. We can instead traverse the tree and 
generate low-level code for computing all of the dot products - this function takes 
as an argument the vector g and computes all of the dot products of Y with g. An 
example of such code was presented in . 

3.4. Comparison to spectral elements. Our approach is remarkably differ- 
ent than the spectral element method. In spectral element methods, one typically 
works with tensor products of Lagrange polynomials over logically rectangular do- 
mains. Efficient algorithms for evaluating the stiffness matrix or its action follow 
naturally by working dimension-by-dimension. While such decompositions are possi- 
ble for unstructured shapes [5], these are restricted to specialized polynomial bases. 
On the other hand, our approach is blind both to the element shape and kind of 
approximating spaces used. While spectral element techniques may ultimately prove 
more effective when available, our approach will enable some level of optimization 
in more general cases, such as Raviart-Thomas-Nedelec [21] [22j [JHl US] elements on 
tetrahedra. 
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Table 4.1 

Element matrix indices and associated tensors (the slice A? for each fixed index i) displayed 
as vectors for the Laplacian on triangles with quadratic basis functions. All vectors are scaled by six 
so they appear as integers. 



index 


vector 


(0, 0) 


3 


3 


3 


3 


(0, i) 


1 





1 





(0, 2) 





1 





1 


(0, 3) 














(0, 4) 





-4 





-4 


(0, 5) 


-4 





-4 





(1,0) 


1 


1 








(1, 1) 


3 











(1,2) 





-1 








(1,3) 





4 








(1,4) 














(1, 5) 


-4 


-4 









index 


vector 


(2, 0) 








1 


1 


(2, 1) 








-1 





(2, 2) 











3 


(2, 3) 








4 





(2, 4) 








-4 


-4 


(2, 5) 














(3, 0) 














(3, 1) 








4 





(3, 2) 





4 








(3, 3) 


8 


4 


4 


8 


(3, 4) 


-8 


-4 


-4 





(3, 5) 





-4 


-4 


-8 



index 


vector 


(4, 0) 


0-4-4 


(4, 1) 





(4, 2) 


0-4 0-4 


(4, 3) 


-8 -4 -4 


(4, 4) 


8 4 4 8 


(4, 5) 


4 4 


(5, 0) 


-4-4 


(5, 1) 


-4 0-4 


(5, 2) 





(5, 3) 


-4 -4 -8 


(5, 4) 


4 4 


(5, 5) 


8 4 4 8 



4. Experimental results. Here, we show that this optimization technique is 
successful at generating low-flop algorithms for computing the element stiffness ma- 
trices associated with some standard variational forms. First, we consider the bilinear 
forms for the Laplacian and advection in one coordinate direction for tetrahedra. Sec- 
ond, we study the trilinear form for the weighted Laplacian. In all cases, we generated 
the element tensors using FFC, the FEniCS Form Compiler [13l [15], which in turn 
relies on FIAT [9] [10] to generate the finite element basis functions and integration 
rules. Throughout this section, we let d ~ 2, 3 refer to the spatial dimension of Q. 

4.1. Laplacian. We consider first the standard Laplacian operator 

a(v,u)= / Vd(i) • Vu(x) dx. (4.1) 
Jn 

We gave a tensor product representation of the local stiffness matrix in equation (|2.6j) . 
The indices of the local stiffness matrix and the associated tensors are shown in 
Table S3] 

Because the clement stiffness matrix is symmetric, wc only need to build the tri- 
angular part. Even without any optimization techniques, this naturally leads from 
computing \P\ 2 contractions at a cost of d 2 multiply-add pairs each to (' P 2 +1 ) contrac- 
tions, where \P\ is the dimension of the polynomial space P. Beyond this, symmetry 
opens up a further opportunity for optimization. For every element e, G e is symmetric. 
The equality of its off-diagonal entries means that the contraction can be performed 
in (J 1 ) rather than d 2 entries. This is readily illustrated in the two-dimensional case. 
We contract a symmetric 2x2 tensor G with an arbitrary 2x2 tensor K: 

q. X — ( ^11 ^* 12 ] . ( K\\ K 12 \ 
V G12 G22 J \ K 2 i K 2 2 J 

= G n K u + G 12 (K U + K 2 i) + G22K22 (4 ' 2) 

= &K, 



where & = (Gu, G12, G 22 ) and K l = (#11, #12 + #21, #22). 
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Table 4.2 

Element matrix indices and associated tensors (the slice A® for each fixed index i) for the 
Laplacian on triangles with quadratic basis functions after transformation to make use of symmetry. 





Vet \ikjL 


(0, 0) 


3 6 3 


(0, i) 


1 1 


(0, 2) 


1 1 


(0, 3) 





(0, 4) 


-4 -4 


(0, 5) 


-4 -4 


(1, 1) 


3 


(1,2) 


0-10 


(1,3) 


4 


(1,4) 





(1,5) 


-4 -4 



index 


vector 


(2, 2) 


3 


(2, 3) 


4 


(2, 4) 


-4 -4 


(2, 5) 





(3, 3) 


8 8 8 


(3, 4) 


-8 -8 


(3, 5) 


-8 -8 


(4, 4) 


8 8 8 


(4, 5) 


8 


(5, 5) 


8 8 8 



This simple calculation implies a linear transformation : from M. dxd into 
obtained by taking the diagonal entries of the matrix together with the sum of the 
off-diagonal entries, that may be applied to each reference tensor, together with an 
associated mapping ~ on symmetric tensors that just takes the symmetric part and 
casts it as a vector. Hence, the overall cost of computing an element stiffness matrix 
before optimizations goes from \P\ 2 d 2 to (' P 2 +1 ) (f^ 1 ) ■ 

An interesting property of this transformation of the reference tensor is that 
it is contractive for the complexity-reducing relations we consider. The Hamming 
distance between two items under • is bounded by the Hamming distance between 
the items. More precisely, p(y, z) < p(y, z). Furthermore, if items are colinear before 
transformation, their images will be as well. So, for the optimizations we consider, 
we will not destroy any dependencies. Moreover, the transformation may introduce 
additional dependencies. For example, before applying the transformation, entries 
(0,1) and (1,5) are not closely related by Hamming distance or colinearity, as seen in 
Table |4~T1 However, after the transformation, we see that the same items in Table I4T21 
are colinear. Other examples can be found readily. 

We optimized the evaluation of the Laplacian for Lagrange finite elements of 
degrees one through three on triangles and tetrahedra using a composite complexity- 
reducing relation with H + , H~ , and c defined in Examples [2] and [3l We performed 
the optimization both with and without symmetry. The results are shown in Ta- 
bles 331 and 2131 Figure |4~T1 shows a diagram of the minimum spanning tree computed 
by our code for the Laplacian on quadratic elements using symmetry. Each node 
of the graph is labeled with a pair indicating the matrix entry (the vectors 

themselves are displayed in Table [472]) . and the edges are labeled with the associated 
weights. Simple inspection reveals that the sum of the edge weights is 14, which when 
added to 3 to compute the dot product for the root node, agrees with the entry for 
quadratics in Table B~41 

These techniques are successful in reducing the flop count, down to less than one 
operation per entry on triangles and less than two on tetrahedra for quadratic and 
cubic elements. We showed in pj] for low degree elements on triangles that going from 
standard numerical quadrature to tensor contractions led to a significant reduction in 
actual run-time for matrix assembly. From the tensor contractions, we got another 
good speedup by simply omitting multiplication by zeros. From this, we only gained 
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Fig. 4.1. Minimum spanning tree for optimized computation of the Laplacian using quadratic 
elements on triangles. The node labeled (3,3) is the root, and the flow is from bottom to top. 



a modest additional speedup by using our additional optimizations. However, this is 
most likely due to the relative costs of memory access and floating point operations. 
We have to load the geometric tensor from memory and we have to write every entry 
of the matrix to memory. So then, n + to in the tables gives a lower bound on memory 
access for computing the stiffness matrix. Our optimizations lead to algorithms for 
which there are a comparable number of arithmetic and memory operations. Hence, 
our optimization has succeeded in reducing the cost of computing the local stiffness 
matrix to a small increment to the cost of writing it to memory. 

4.2. Advection in one coordinate direction. Now, we consider constant 
coefficient advection aligned with a coordinate direction 

a(v,u)= [ v{x)^^-dx. (4.3) 
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Table 4.3 

Number of multiply-add pairs in the optimized algorithm for computing the Laplacian element 
stiffness matrix on triangles and tetrahedra for Lagrange polynomials of degree one through three 
without using symmetry. 



triangles 



tetrahedra 



degree 


n 


m 


nm 


MAPs 


1 


9 


4 


36 


13 


2 


36 


4 


144 


25 


3 


100 


4 


400 


74 



degree 


n 


m 


nm 


MAPs 


1 


16 


9 


144 


43 


2 


100 


9 


900 


205 


3 


400 


9 


3600 


864 



Table 4.4 

Number of multiply-add pairs in the optimized algorithm for computing the Laplacian element 
stiffness matrix on triangles and tetrahedra for Lagrange polynomials of degree one through three 
using symmetry. 



triangles 



tetrahedra 



degree 


n 


777 


nm 


MAPs 


1 


6 


3 


18 


9 


2 


21 


3 


63 


17 


3 


55 


3 


165 


46 



degree 


n 


777 


nm 


MAPs 


1 


10 


6 


60 


27 


2 


55 


6 


330 


101 


3 


210 


6 


1260 


370 



This is part of the operator associated with constant coefficient advection in some 
arbitrary direction — optimizing the other coordinate directions would give similar 
results. These results are shown in Table 14.51 Again, our optimization generates 
algorithms for which the predominant cost of computing the element matrix is writing 
it down, as there is significantly fewer than one floating point cycle per matrix entry 
in every case. 

4.3. Weighted Laplacian. Our final operator is the variable coefficient Lapla- 
cian: 

a w (v, u) — / w(x)\7v(x) • Vu(x) dx. (4-4) 



Jn 

This form may be viewed as a trilinear form a(w, v, u) in which w is the projec- 
tion of the coefficient into the finite element space. For many problems, this can be 
performed without a loss in order of convergence. For nonlinear problems, we will 
have to reassemble this form at each nonlinear iteration, so it is an important step to 
optimize. The reference tensor is 



E 



and the geometric tensor is 



Gt = w ai detF' e 9 ^ d ^ . (4.6) 

OXp OX/3 



Note that the geometric tensor is the outer product of the geometric tensor for the 
constant coefficient Laplacian, which we shall denote (G L ) e , with the coefficients of 
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Table 4.5 

Number of multiply- add pairs in the optimized algorithm for computing the coordinate- advection 
element stiffness matrix on triangles and tetrahedra for Lagrange polynomials of degree one through 
three. 



triangles 



tetrahedra 



degree 


n 


m 


nm 


MAPs 


1 


9 


2 


18 


4 


2 


36 


2 


72 


22 


3 


100 


2 


200 


59 



degree 


n 


m 


nm 


MAPs 


1 


16 


3 


48 


9 


2 


100 


3 


300 


35 


3 


400 


3 


1200 


189 



the weight function. Unlike the constant coefficient case, the amount of arithmetic 
per entry in the element stiffness matrix grows with the polynomial degree. 

We could simply proceed with the optimization as we did for other forms — the 
element tensor is just a collection of vectors that will have to be dotted into G e for 
each element in the mesh, but now the dot products are more expensive since the 
vectors are longer. In this case, G e must be explicitly formed for each element (this 
costs \P\cP once (G L ) e is formed). On the other hand, we could use the decomposition 
of G e into (G L ) e and the coefficient vector uut and do the contractions in stages. For 
example, we could organize the contraction as 

* = (4(^,^3) (G L ) ( r a3) ) Wai , (4.7) 

that is, for each of the \P\ 2 entries of the stiffness matrix, we compute \P\ contractions 
of d x d tensors with {G L ) e - This is a similar optimization problem as the Laplacian, 
but with \P\ times more vectors to optimize over. After we do this set of computations, 
we must compute \P\ 2 dot products with the coefficient vector Wk- Note that the 
contractions with (G L ) e may be optimized, but the resulting vectors to dot with 
Wk will not be known until run-time and must be computed at full cost. Hence, a 
lower bound for this approach is \P\ multiply-add pairs per entry (assuming that the 
contractions with G L were absolutely free). On the other hand, one could contract 
with the coefficient first to give an array of \P\ 2 tensors of size d x d, and then contract 
each of these with (G L ) e . As before, the first step can be optimized, but the second 
step cannot. 

In any of these cases, we may use the same transformations to exploit symmetry 
as we did in the constant coefficient case. Since G" 1,a2 ' a3 = G" 1 '" 3 012 , we may view 
each slice A®, of the reference tensor as an array of \P\ tensors of size d x d and apply 
the transformation to each of these. If we fully form G e , this reduces the cost from 
\P\d 2 to I-? 5 ! ( d J 1 ) ■ I n au °f our experiments, we made use of this. 

In Table 14. 6[ we see the cost of computing the weighted Laplacian by the first 
approach (optimizing directly the tensor product Af = A^ a G"). While the opti- 
mizations are not as successful as for the constant coefficient operators, we still get 
reductions of 30%-50% in the operation counts. 

When we perform the contraction in stages, we find more dependencies (for ex- 
ample, the slices of two of the tensors could be colinear although the entire tensors 
are not). We show the cost of performing the optimized stage for contracting with 
(G L ) e first in Table 14771 and for contracting with Wk first in Table l4~8l 

In order to get a fair comparison between these approaches, we must factor in 
the additional costs of building G e or performing the second stage of contraction. 
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Table 4.6 

Number of multiply-add pairs in the optimized algorithm for computing the weighted Laplacian 
element stiffness matrix on triangles and tetrahedra for Lagrange polynomials of degree one through 
three using symmetry. 



triangles 



tetrahedra 



degree 


n 


m 


nm 


MAPs 


1 


6 


9 


54 


27 


2 


21 


18 


378 


218 


3 


55 


30 


1650 


1110 



degree 


n 


m 


nm 


MAPs 


1 


10 


24 


240 


108 


2 


55 


60 


3300 


1650 


3 


210 


120 


25200 


14334 



Table 4.7 

Number of multiply-add pairs in the optimized algorithm for performing all of the contractions 
with (G L ) e in the weighted Laplacian on triangles and tetrahedra first, resulting in (' i3 ^~ 1 ) arrays 
of length \P\ to contract with w^. 

triangles 



tetrahedra 



degree 


n 


m 


nm 


MAPs 


1 


18 


3 


54 


9 


2 


126 


3 


378 


115 


3 


550 


3 


1650 


683 



degree 


n 


m 


nm 


MAPs 


1 


40 


6 


240 


27 


2 


550 


6 


3300 


693 


3 


4200 


6 


25200 


7021 



Once {G L ) e is built and symmetrized, it costs an additional l-PK^ 1 ) multiply-add 
pairs to construct G e . If we optimize the computation of contracting with (G L ) e 
first, we do not have to build G e , but we must perform a dot product with Wk for 
each entry of the matrix. This costs \P\ per contraction with (' P 2 +1 ) entries in the 
matrix. If we optimize the contraction with each first, then we have an additional 
(' P 2 +1 ) contractions with (G L ) e at a cost of ( d ~^ 1 ) each. We expect that which of 
these will be most effective must be determined (automatically) on a case-by-case 
basis. Table [4791 shows the comparisons for the first approach (labeled G e ), the second 
approach (labeled (G L ) e ) and the third approach (labeled Wk) by indicating the cost 
of the optimized computation plus the additional stages of computation. In most of 
these cases, contracting with the coefficient first leads to the lowest total cost. 

5. Optimizing the optimization process. Since our graph (Y,E) is com- 
pletely connected, we have \E\ = 0(\Y\ 2 ) and our optimization process requires com- 
plexity that is at least quadratic in the number of entries in the element stiffness 
matrix. In this section, we show how certain useful complexity-reducing relations 
may be evaluated over all of Y in better than quadratic time, then discuss how we 
may build a sparse graph (Y,E') with \E\ = 0(|F|) that will admit a much more 
efficient optimization process. Even though this process must be run only once per 
form and element (say the Laplacian with quadratics on triangles), the quadratic al- 
gorithm can become very time consuming and challenge a single computer's resources 
for forms of high arity using high degree polynomials on tetrahedra. 

5.1. Search algorithms. Before discussing how we may evaluate some of these 
complexity-reducing relations over the collection Y in better than quadratic time, we 
first describe some basic notation we will use for hash tables throughout this section 
and the next. 

Hash tables are standard data structures [2] that associate each member of a 
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Table 4.8 

Number of multiply- add pairs in the optimized algorithm for performing all of the contractions 
with the coefficient in the weighted Laplacian on triangles and tetrahedra first, resulting in (' P 2 +1 ) 
arrays of length = 6 to contract with (G L ) e . 



triangles 



tetrahedra 



degree 


n 


m 


nm 


MAPs 


1 


18 


3 


54 


7 


2 


63 


6 


378 


138 


3 


165 


10 


1650 


899 



degree 


n 


m 


nm 


MAPs 


1 


60 


4 


240 


9 


2 


330 


10 


3300 


465 


3 


1260 


20 


25200 


7728 



Table 4.9 

Comparing the total number of multiply-add pairs for fully forming G e , contracting with (G L ) e 
first, and contracting with Wk first on both triangles and tetrahedra. 



triangles 



tetrahedra 



G e 


degree 


MST 


additional 


total 


1 


27 


3*3 


38 


2 


218 


3*6 


236 


3 


1110 


3*10 


1140 



(G L ) e first 


degree 


MST 


additional 


total 


1 


9 


6*3 


27 


2 


115 


21*6 


241 


3 


683 


55*10 


1233 



Wk first 


degree 


MST 


additional 


total 


1 


7 


6*3 


25 


2 


138 


21*3 


201 


3 


899 


55*3 


1064 



G e 


degree 


MST 


additional 


total 


1 


108 


6*4 


132 


2 


1650 


6*10 


1710 


3 


14334 


6*20 


14454 



{G L ) e first 


degree 


MST 


additional 


total 


1 

2 
3 


27 
693 
7021 


10*4 
55*10 
210*20 


67 
1234 
11221 




Wk first 


degree 


MST 


additional 


total 


1 

2 
3 


9 
465 
7728 


10*6 
55*6 
210*6 


69 
795 
8988 



set of keys to some value, possibly drawn from another set of objects. The important 
point about hash tables is that the basic operations of setting and getting values are 
expected to be independent of the number of entries in the table (expected constant 
time access). Many higher- level programming languages have library support or built- 
in features supporting hash tables (many implementations of the standard template 
library in C++ include hash tables, and scripting languages such as Python and Perl 
have them built in as primitive types). 

We begin by establishing some notation for the basic operations we use. If a is a 
key of table T, then we find the value associated with a by T[a]. If there is no value 
associated with a (that is, if a is not a key of T, we may update T by adding a key 
a associated with value b by the notation T[a] <— b. We use the same notation to 
indicate setting a new value to an existing key. 

As before, we label the vectors y l for 1 < i < n. We want to partition the labels 
into a set of subsets £ such that for each E e £, the vectors associated with each 
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label in E are equal. Moreover, if two vectors are equal, then their labels must belong 
to the same E. This is easily accomplished by setting up a hash table whose keys 
are vectors and whose values are subsets of the integers 1 < i < n. This process is 
described in Algorithm Q] 



Algorithm 1 Determining equality among vectors 

E an empty table mapping vectors to subsets of {«}" =1 . 
for all 1 < i < n do 

if y % is a key of E then 

E[y l ] <- E[v l ] U {i} 
else if — y % is a key of E then 

E[-f] «- E[-y l ] U {i} 
else 

end if 
end for 



Floating point arithmetic presents a slight challenge to hashing. Numbers which 
are close together (within some tolerance) that should be treated as equal must be 
rounded to so that they are indeed equal. Hashing relies on a function that maps 
items into a set of integers (the "hash code"). These functions are discontinuous 
and sensitive to small perturbations. For most numerical algorithms in floating point 
arithmetic, we may define equality to be "near equality", but hash tables require 
us to round or use rational arithmetic before any comparisons are made. We have 
successfully implemented our algorithms in both cases. 

As an alternative to hashing, one could form a binary search tree or sort the 
vectors by a lexicographic ordering. These would rely on a more standard "close to 
equal" comparison operation, but only run in 0(mn log (mn)) time. So, for large 
enough data sets, hashing will be more efficient. 

We may similarly partition the labels into a set of subsets C such that for each 
C G C, the vectors associated with the labels in C are colinear. Similarly, if two 
vectors are colinear, then their labels must belong to the same C. This process may 
be performed by constructing the collection of unit vectors A, with yi = j^-jy for each 

1 < i < n for some norm j • || and using Algorithm [T] on A. 

Finding vectors that are close together in Hamming distance is more subtle. At 
worst, the cost is 0(mn 2 ), as we have to compare every entry of every pair of vectors. 
However, we may do this in expected linear time with some assumptions about Y. 
We first describe the algorithm, then state the conditions under which the algorithm 
performs in worse than linear time. 

Our vectors each have m components. We start by forming m empty hash tables. 
Each Hi will map numbers that appear in the i th position of any vector to the labels 
of vectors that have that entry in that position. This is presented in Algorithm [2J in 
which y l j denotes the j th entry of y 1 . 

This process runs in expected 0(nm) time. From these tables, we can construct 
a table that gives the Hamming distance between any two vectors, as seen in Algo- 
rithm [3] This algorithm counts down from d each time it discovers an entry that two 
vectors share. Our algorithm reflects the symmetry of the Hamming distance. 

On output, for 1 < i < j < n, if -D[i][j] has no entry, the distance between v l 
and Vj is m. Otherwise, the table contains the Hamming distance between the two 
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Algorithm 2 Mapping unique entries at each position to vectors containing them 
for all 1 < i < d do 

Hi an empty table mapping numbers to sets of vector labels from {i}™ =1 
end for 

for all 1 < i < n do 
for all 1 < j < m do 
if Uj is a key of Hj then 

else 

HAvj] :={<} 
end if 

end for 

end for 



Algorithm 3 Computing Hamming distances efficiently 
D an empty table 
for all 1 < i < n do 

D[i] an empty table 
end for 

for all 1 < i < m do 

for all a in the keys of Hi do 

for all unique combinations k,£ of labels in Hi[a] do 
a := min(fc, £), ft :— max(fc, £) 
if D[a) has a key f3 then 
D[a]\0\<-D[a]\0\-1 
else 

D[a][/3] :=m-l 
end if 
end for 
end for 
end for 



vectors. 

Regarding complexity, there is a double loop over the entries of each Hi. Hence, 
the algorithm is quadratic in the maximum number of vectors that share the same 
entry at the same location. Presumably, this is considerably less than n on most data 
sets. 

5.2. Using a sparse graph. If we create a graph (Y, E) with significantly fewer 
edges than (Y,E), we may be able to get most of the reduction in operation count 
while having a more efficient optimization process. For example, we might choose to 
put a cutoff on p, only using edges that have a large enough complexity-reduction. 
So, we can define the set of edges being 

E k = {{y\yi}:p(y\yi)<k} (5.1) 

For example, if we use Algorithms [2] and [3] to evaluate H + over all pairs from Y, 
then we are using p = H + and fc = m — 1. Also, note that our structure D encodes 
a sparse graph. That is, the vectors of Y are the nodes of the graph, the neighbors 
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of each y % are simply the keys of D[y l ], and the edge weight between some y % and 
neighbor y 3 is [j- 7 ] ■ It is not hard to move from this table into whatever data 
structure is used for graphs. 

Then, we could add colinearity or other complexity-reducing relations to this 
graph. If we use Algorithm [T] on the unit vectors to determine sets of colinear vectors, 
we can update the the graph by either adding edges or updating the edge weights for 
each pair of colinear vectors. 

If \Ek\ = 0(\Y\), then computing a minimum spanning tree will require only 
0(n log n) time rather than 0(n 2 log n). However, there is no guarantee that (Y, Ek) 
will be a connected graph. Some vectors might not have close neighbors, or else 
some subgraphs do not connect with each other. An optimized algorithm can still be 
obtained by finding the connected components of the (Y, Ek) and finding a minimum 
spanning tree for each component. Then, the total cost of the computation is m 
times the number of connected components plus the sum of the weights of each of the 
minimum spanning trees. 

6. Conclusion and ongoing work. We have developed a general optimiza- 
tion strategy for the evaluation of local stiffness matrices for finite elements. This 
is based on first formulating the computation as a sequence of tensor contractions, 
then introducing a new concept of complexity-reducing relations that allows us to set 
the optimization in a graph context. The optimization itself proceeds by computing 
a minimum spanning tree. These techniques worked very well at reducing the cost 
of evaluating finite element matrices for several forms using Lagrange elements of 
degrees one through three on triangles and tetrahedra. Finally, we discussed some ef- 
ficient algorithms for detecting equality and colinearity and for evaluating the pairwise 
Hamming distance over the entire set of tensors. 

We hope to extend our optimizations in several ways. First, we remarked that 
it may be possible to enrich the collection of tensors in such as way as to deliever a 
minimum spanning tree with lower total weight. We hope to explore such a process 
in the future. Also, we saw in [11] that frequently, some of the tensors are linear 
combinations of two or more other tensors. However, both locating and making use 
of such relations in a more formal context has been difficult. We are working on 
geometric search algorithms to locate linear dependencies efficiently. However, once 
they are located, our optimization process must occur over a hypergraph rather than 
a graph. Finding a true minimum is also much more difficult, and we are working on 
heuristics that will allow us to combine these two approaches. 

Finally, we plan to integrate our optimization strategy with FFC. While FFC 
currently generates very efficient code for evaluating variational forms, we will improve 
upon this generated code by piping the tensors through our optimization process 
before generating code to perform the contractions. This will lead to a domain- 
specific optimizing compiler for finite elements; by exploiting latent mathematical 
structure, we will automatically generate more efficient algorithms for finite elements 
than people write by hand. 
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